A New Integrated Machine Learning Approach for Web Page Categorization
نویسندگان
چکیده
––Clustering is an unsupervised task whereas classification is supervised in nature. In the context of machine learning, classification of instances of a dataset is carried out by a classifier after the classifier is made to learn the model from a training dataset. The training data consists of instances which are labeled by a human expert. The labels are the classes into which the instances of the dataset are divided and are fixed by the human expert. The essence is that human intervention is required in the form of preparing the training data for the machine to carry out the task of classification. Clustering of large datasets is universally accepted as a difficult problem, since it tries to group instances together, without the helping hand of the human supervisor. Also, the time complexity of algorithms such as K-Medoids is unacceptably high, even for moderately large datasets. The work reported in this paper aims to integrate both clustering and classification and test approach in the domain of web page categorization. Rather than using training data created by a human expert for classification, clustering is used in preparing the training data for the classifier.
منابع مشابه
Refined and Incremental Centroid-based approach for Genre Categorization of Web pages
In this paper, I propose a refined and incremental centroid-based approach for genre categorization of web pages. My approach is based on the construction of genre centroids using a set of training web pages. These centroids will be used to classify new web pages. The originality of my approach is the implementation of two new aspects, which are refining and incrementing. My approach is based o...
متن کاملMachine Learning Methods For Chinese Web Page Categorization
This paper reports our evaluation of k Nearest Neighbor (kNN), Support Vector Machines (SVM), and Adaptive Resonance Associative Map (ARAM) on Chinese web page classi cation. Benchmark experiments based on a Chinese web corpus showed that their predictive performance were roughly comparable although ARAM and kNN slightly outperformed SVM in small categories. In addition, inserting rules into AR...
متن کاملAnomaly-based Web Attack Detection: The Application of Deep Neural Network Seq2Seq With Attention Mechanism
Today, the use of the Internet and Internet sites has been an integrated part of the people’s lives, and most activities and important data are in the Internet websites. Thus, attempts to intrude into these websites have grown exponentially. Intrusion detection systems (IDS) of web attacks are an approach to protect users. But, these systems are suffering from such drawbacks as low accuracy in ...
متن کاملWeb Content Categorization Using Link Information
Document categorization is one of the foundational problems in (web) information retrieval. Even though web documents are hyperlinked, most proposed classification techniques take little advantage of the link structure and rely primarily on text features, as it is not immediately clear how to make link information intelligible to supervised machine learning algorithms. This paper introduces a l...
متن کاملAnalyzing new features of infected web content in detection of malicious web pages
Recent improvements in web standards and technologies enable the attackers to hide and obfuscate infectious codes with new methods and thus escaping the security filters. In this paper, we study the application of machine learning techniques in detecting malicious web pages. In order to detect malicious web pages, we propose and analyze a novel set of features including HTML, JavaScript (jQuery...
متن کامل